Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence
نویسندگان
چکیده
Identifying and classifying personal, geographic, institutional or other names in a text is an important task for numerous applications. This paper describes and evaluates a language-independent bootstrapping algorithm based on iterative learning and re-estimation of contextual and mOrphological patterns captured in hierarchically smoothed trie models. The algorithm learns from unannotated text and achieves competitive performance when trained on a very short labelled name list with no other required language-specific information, tokenizers or tools. 1 I n t r o d u c t i o n The ability to determine the named entities in a text has been established as an important task for several natural language processing areas, including information retrieval, machine translation, information extraction and language understanding. For the 1995 Message Understanding Conference (MUC-6), a separate named entity recognition task was developed and the best systems achieved impressive accuracy (with an F-measure approaching 95%). What should be underlined here is that these systems were trained for a specific domain and a particular langnage (English), typically making use of hand-coded rules, taggers, parsers and semantic lexicons. Indeed, most named entity recognizers that have been published either use tagged text, perform syntactical and morphological analysis or use semantic information for contextual clues. Even the systems that do not make use of extensive knowledge about a particular language, such as Nominator (Choi et al., 1997), still typically use large data files containing lists of names, exceptions, personal and organizational identifiers. Our aim has been to build a maximally langnageindependent system for both named-entity identification and classification, using minimal information about the source language. The applicability of AI-style algorithms and supervised methods is limited in the multilingual case because of the cost of knowledge databases and manually annotated corpora. Therefore, a much more suitable approach is to consider an EM-style bootstrapping algorithm. In terms of world knowledge, the simplest and most relevant resource for this task is a database of known names. For each entity class to be recognized and tagged, it is assumed that the user can provide a short list (order of one hundred) of unambiguous examples (seeds). Of course the more examples provided, the better the results, but what we try to prove is that even with minimal knowledge good results can be achieved. Additionally some basic particularities of the language should be known: capitalization (if it exists and is relevant some languages do not make use of capitalization; in others, such as German, the capitalization is not of great help), allowable word separators (if they exist), and a few frequent exceptions (like the pronoun "/" in English). Although such information can be utilised if present, it is not required, and no other assumptions are made in the general model. 1.1 WordIn te rna l and Contextual Information The algorithm relies on both word internal and contextual clues as relatively independent evidence sources that drive the bootstrapping algorithm. The first category refers to the morphological structure of the word and makes use of the paradigm that for certain classes of entities some prefixes and suffixes are good indicators. For example, knowing that "Maria", "Marinela" and "Maricica" are feminine first names in Romanian, the same classification may be a good guess for "Mariana", based on common prefix. Suffixes are typically even more informative, for example "-escu" is an almost perfect indicator of a last name in Romanian, the same applies to "-wski" in Polish, "-ovic" and "-ivic" in SerboCroatian, "-son" in English etc. Such morphological information is automatically learned during bootstrapping. Contextual patterns (e.g. "Mr.", "in" and "mayor of" in left context) are also clearly crucial to named entity identification and classification, especially for names that do not follow a typical morphological pattern for their word class, are of foreign origin or polysemous (for example, many places or
منابع مشابه
A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...
متن کاملUsing Language Independent and Language Specific Features to Enhance Arabic Named Entity Recognition
The Named entity recognition task has been garnering significant attention as it has been shown to help improve the performance of many natural language processing applications. More recently, we are starting to see a surge in developing named entity recognition systems for languages other than English. With the relative abundance of resources for the Arabic language and a certain degree of mat...
متن کاملArabic Named Entity Recognition: an Svm-based Approach
The Named Entity Recognition (NER) task has been garnering significant attention as it has been shown to help improve the performance of many Natural Language Processing (NLP) applications. More recently, we are starting to see a surge in developing NER systems for languages other than English. With the relative abundance of resources for the Arabic language and a certain degree of maturation i...
متن کاملLanguage Independent NER using a Unified Model of Internal and Contextual Evidence
This paper investigates the use of a language independent model for named entity recognition based on iterative learning in a co-training fashion, using word-internal and contextual information as independent evidence sources. Its bootstrapping process begins with only seed entities and seed contexts extracted from the provided annotated corpus. F-measure exceeds 77 in Spanish and 72 in Dutch.
متن کاملبهبود شناسایی موجودیتهای نامدار فارسی با استفاده از کسره اضافه
Named entity recognition is a process in which the people’s names, name of places (cities, countries, seas, etc.) and organizations (public and private companies, international institutions, etc.), date, currency and percentages in a text are identified. Named entity recognition plays an important role in many NLP tasks such as semantic role labeling, question answering, summarization, machine ...
متن کامل